A study of voice activity detection techniques for NIST speaker recognition evaluations

نویسندگان

  • Man-Wai Mak
  • Hon-Bill Yu
چکیده

Since 2008, interview-style speech has become an important part of the NIST Speaker Recognition Evaluations (SREs). Unlike telephone speech, interview speech has lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/non-speech segmentation in these files. To overcome these difficulties, this paper proposes using speech enhancement techniques as a preprocessing step for enhancing the reliability of energy-based and statisticalmodel-based VADs. A decision strategy is also proposed to overcome the undesirable effects caused by impulsive signals and sinusoidal background signals. The proposed VAD is compared with the ASR transcripts provided by NIST, VAD in the ETSI-AMR Option 2 coder, satistical-model (SM) based VAD, and Gaussian mixture model (GMM) based VAD. Experimental results based on the NIST 2010 SRE dataset suggest that the proposed VAD outperforms these conventional ones whenever interview-style speech is involved. This study also demonstrates that (1) noise reduction is vital for energy-based VAD under low SNR; (2) the ASR transcripts and ETSI-AMR speech coder do not produce accurate speech and non-speech segmentations; and (3) spectral subtraction makes better use of background spectra than the likelihood-ratio tests in the SM-based VAD. The segmentation files produced by the proposed VAD can be found in http://bioinfo.eie.polyu.edu.hk/ssvad.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Voice Activity Detectors for Interview Speech in NIST Speaker Recognition Evaluation

Interview speech has become an important part of the NIST Speaker Recognition Evaluations (SREs). Unlike telephone speech, interview speech has substantially lower signal-to-noise ratio, which necessitates robust voice activity detection (VAD). This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/nonspeech segmentat...

متن کامل

The IIR NIST SRE 2008 and 2010 summed channel speaker recognition systems

This paper reports the IIR speaker recognition system for the summed channel evaluation tasks in the NIST SRE 2008 and 2010. The system includes three main modules: voice activity detection, speaker diarization and speaker recognition. The front-end process employs a voice activity detection algorithm for effective speech frame selection. The speaker diarization system that was developed for 20...

متن کامل

Study of Overlapped Speech Detection for NIST SRE Summed Channel Speaker Recognition

This paper studies the overlapped speech detection for improving the performance of the summed channel speaker recognition system in NIST Speaker Recognition Evaluation (SRE). The speaker recognition system includes four main modules: voice activity detection, speaker diarization, overlapped speaker detection and speaker recognition. We adopt a GMM based overlapped speaker detection system, by ...

متن کامل

Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation

The introduction of interview speech in recent NIST Speaker Recognition Evaluations (SREs) has necessitated the development of robust voice activity detectors (VADs) that can work under very low signal-to-noise ratio. This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties of detecting speech/non-speech segments in these files. To alleviat...

متن کامل

A noise-robust system for NIST 2012 speaker recognition evaluation

The National Institute of Standards and Technology (NIST) 2012 speaker recognition evaluation posed several new challenges including noisy data, varying test-sample length and number of enrollment samples, and a new metric. Target speakers were known during system development and could be used for model training and score normalization. For the evaluation, SRI International (SRI) submitted a sy...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Computer Speech & Language

دوره 28  شماره 

صفحات  -

تاریخ انتشار 2014